pandas 🐼¶(Víctor Sojo | vsojo@amnh.org)
In this notebook we will learn how to use pandas, a mighty data-analysis tool for Python. The main type of object in pandas is the DataFrame, which is a type of dynamic table (similar to an Excel spreadsheet) in which you can access, filter, sort and manipulate columns and rows in extremely efficient and powerful ways. If you have ever worked in R, pandas' DataFrames work very similarly to the ones in R (in fact, in the following notebook we will learn that we can easily interface R and Python within Jupyter).
References:
pandas, I strongly recommend the official resources above, and their general documentation, including their list of specific beginner tutorials. Getting started
Creating a pandas.DataFrame())
Exploring the size(s) of a dataframe-of-a-dataframe)
Accessing columns in a dataframe
Arithmetic operations across entire columns
Adding columns to a dataframe
Deleting or \"dropping\" columns
Renaming columns
Applying functions on columns
Getting the data types of columns
Boolean operations and filtering/subsetting the rows in a dataframe
Using .loc[] and .iloc[] to filter and explore rows and columns
Using .loc (or .iloc) to replace values in a dataframe-to-replace-values-in-a-dataframe)
Writing and reading data to/from text files
Loading in a more realistic dataset
Use pivot to transform a stacked dataframe to a multi-indexed one
Flattening a multi-index dataframe
Resetting the index of a dataframe
Getting more than a sample of rows
Summary statistics
Merging dataframes
Comparing two columns
Replacing specific values in a dataframe with .replace())
Computing aggregate statistics per group with .groupby())
Sorting columns
Plotting pandas data
Sending pandas dataframes to R and vice versa
In case you haven't done so already, let's first create a conda environment specifically for data analysis. Open a new terminal or Anaconda prompt window and do the following:
conda update --all -y
conda create -n data python=3.9 jupyter pandas matplotlib plotly scikit-learn rpy2 r-essentials -y
(You will recognise some of those packages already. I will explain what the others do as we use them.)
Once you've done that, if you've got this notebook open in Jupyter:
data environment:conda activate data
jupyter notebook
If everything went well, the following should print out "data":
! echo $CONDA_DEFAULT_ENV
data
⚠️ If you're on Windows, remember that every line starting with a !, such as ! some code should be changed to !wsl some code, and you need to have an active WSL installation.
import pandas as pd
Note that we're giving pandas an alias of pd. This is completely unnecessary but everybody does it because it makes the code shorter and arguably easier to read.
You will notice below that we're also using matplotlib and plotly, but we didn't import them here (or anywhere else in this notebook). This works because we will be using pandas internal methods to plot directly, so pandas will kindly take care of any imports it needs as they become necessary.
pandas.DataFrame()¶There are many ways to create a pandas.Dataframe from scratch. One of the easiest is with a dictionary, wherein each key is a column name and each value is a list of row values for that column.
In the example below we define two columns:
'threes' contains a few multiples of 3.'squares' contains the squares of those numbers.df = pd.DataFrame(
{
'threes' : [0, 3, 6, 9, 12, 15],
'squares' : [0, 9, 36, 81, 144, 225]
}
)
print(df)
threes squares 0 0 0 1 3 9 2 6 36 3 9 81 4 12 144 5 15 225
As you can see, pandas adds an extra index column at the left, which, in traditional Python fashion, starts counting at 0 and keeps going up one by one (just like the indices in a list or in a numpy.array).
You may remember two things from our previous lessons on Jupyter:
print() to explicitly print anything to screen.
I will backtrack from that recommendation in the case of pandas, because pandas dataframes are printed much more nicely in Jupyter if you just write their name in the last line, without print():# don't use print() to print out pandas dataframes, they look much nicer without:
df
| threes | squares | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 3 | 9 |
| 2 | 6 | 36 |
| 3 | 9 | 81 |
| 4 | 12 | 144 |
| 5 | 15 | 225 |
pandas dataframes!¶There must be at least a dozen ways to create a pandas dataframe. Above we used a basic dictionary method in which we have column_name:[values, in, the, column], and below we will read in data from a file. In my view these two are by far the most common and intuitive, but there are many other ways, described in the pandas documentation. Here's a nice unofficial tutorial.
df.shape
(6, 2)
From the shape, we can trivially extract the number of rows (df.shape[0]) and the number of columns (df.shape[1]); but if you want to be a bit safer, you can try the methods below.
len(df)
6
So, the built-in Python function len() returns the number of records or entries in the dataframe, i.e., the number of rows.
len(df.columns)
2
And with that, let's look a little closer into how to access columns in a dataframe.
The two main ways to extract the values of a column are:
df.mycolumndf['mycolumn']object.attribute notation¶df.threes
0 0 1 3 2 6 3 9 4 12 5 15 Name: threes, dtype: int64
dictionary['key'] notation¶df['threes']
0 0 1 3 2 6 3 9 4 12 5 15 Name: threes, dtype: int64
The advantage of object notation is that it looks really neat and is therefore easy to read and quick to write. Conversely, dictionary notation is clunky to write, but it has the advantages that the key can have spaces in its name, and also you can pass the column name (the key) as a variable:
my_column = 'threes'
df[my_column]
0 0 1 3 2 6 3 9 4 12 5 15 Name: threes, dtype: int64
Note: dictionary notation is more universal and robust, i.e. it works in most cases, whereas attribute notation does not (e.g. for creating or deleting columns, you must use dict notation).
By the way, you may notice that the last line tells us the name of the column (threes) but it also tells us that its dtype is int64, an integer of 64 bits. We study data types in more detail in the numpy lesson; in brief: wherever possible, pandas leverages numpy to optimise the contents of dataframes. This means that, because all elements in the threes column are integers, pandas can store them much more efficiently in memory.
There are several ways to achieve this. A good one is to use double brackets, with the desired columns separated by commas inside the inner brackets:
df[['threes', 'squares']]
| threes | squares | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 3 | 9 |
| 2 | 6 | 36 |
| 3 | 9 | 81 |
| 4 | 12 | 144 |
| 5 | 15 | 225 |
In this case, those were our only two columns, so we just get the entire dataframe again, but you can choose whichever columns you wish, in any order. For example, try putting 'squares' first and 'threes' second.
We can do basic operations on each of the elements of a column very efficiently:
df.threes * 2
0 0 1 6 2 12 3 18 4 24 5 30 Name: threes, dtype: int64
And we can also do operations side-by-side between columns. For example, we can multiply the column with the multiples of 3 by the column that has their squares, to get the cubes:
df.threes * df.squares
0 0 1 27 2 216 3 729 4 1728 5 3375 dtype: int64
Our dataframe remains unaltered though:
df
| threes | squares | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 3 | 9 |
| 2 | 6 | 36 |
| 3 | 9 | 81 |
| 4 | 12 | 144 |
| 5 | 15 | 225 |
So let's see how to keep the result in a new column if we wish.
For this, you just specify the name of the new column and tell pandas what you want it to contain:
# Add a column with the cubes of the multiples of 3
df['cubes'] = df.threes * df.squares # Using dict notation for the new column but attribute notation for existing values
# and another one with the square roots
df['sqroots'] = df['threes'] ** 0.5 # Using dict notation in both cases
# Print the modified dataframe
df
| threes | squares | cubes | sqroots | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0.000000 |
| 1 | 3 | 9 | 27 | 1.732051 |
| 2 | 6 | 36 | 216 | 2.449490 |
| 3 | 9 | 81 | 729 | 3.000000 |
| 4 | 12 | 144 | 1728 | 3.464102 |
| 5 | 15 | 225 | 3375 | 3.872983 |
⚠️ Note that you cannot use object.attribute notation for the name of the new column, you must use dictionary['key'] notation. However, for the values used in the calculation, you can use object notation if you wish (as we did above for the first calculation) or dict notation (as we did for the second calculation).
Columns don't have to be calculated from other columns, you can just add columns with whatever content you wish, as long as the list you provide is as long as the number of rows in the dataframe:
df['names'] = ['zero', 'three', 'six', 'nine', 'twelve', 'fifteen']
df
| threes | squares | cubes | sqroots | names | |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0.000000 | zero |
| 1 | 3 | 9 | 27 | 1.732051 | three |
| 2 | 6 | 36 | 216 | 2.449490 | six |
| 3 | 9 | 81 | 729 | 3.000000 | nine |
| 4 | 12 | 144 | 1728 | 3.464102 | twelve |
| 5 | 15 | 225 | 3375 | 3.872983 | fifteen |
⚠️ Just note that the length of the new column must be the same as the number of rows in the dataframe.
This is trivially done by two methods. One is using Python's del keyword:
del df['squares']
df
| threes | cubes | sqroots | names | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0.000000 | zero |
| 1 | 3 | 27 | 1.732051 | three |
| 2 | 6 | 216 | 2.449490 | six |
| 3 | 9 | 729 | 3.000000 | nine |
| 4 | 12 | 1728 | 3.464102 | twelve |
| 5 | 15 | 3375 | 3.872983 | fifteen |
Just like for adding columns, you must use dictionary notation to remove a column with del, i.e., if you had instead tried:
del df.squares
you would have gotten an error.
The other method to remove a column is to "drop" it by using the pandas method .drop(columns=[col1, col2, ...]):
df.drop(columns=['cubes'], inplace=True)
df
| threes | sqroots | names | |
|---|---|---|---|
| 0 | 0 | 0.000000 | zero |
| 1 | 3 | 1.732051 | three |
| 2 | 6 | 2.449490 | six |
| 3 | 9 | 3.000000 | nine |
| 4 | 12 | 3.464102 | twelve |
| 5 | 15 | 3.872983 | fifteen |
⚠️ We need to specify inplace=True so that we make the change into the dataframe itself, as opposed to returning a changed version of it (the default behaviour).
To rename column(s), pandas uses a dictionary syntax, in which the key is given as old_name and the value is the new_name. For example, to rename both the sqroots and names columns:
df.rename( columns={'sqroots' : 'sqrt', 'names' : 'number name'} , inplace=True )
df
| threes | sqrt | number name | |
|---|---|---|---|
| 0 | 0 | 0.000000 | zero |
| 1 | 3 | 1.732051 | three |
| 2 | 6 | 2.449490 | six |
| 3 | 9 | 3.000000 | nine |
| 4 | 12 | 3.464102 | twelve |
| 5 | 15 | 3.872983 | fifteen |
Just like for dropping a column, we need inplace=True.
⚠️ Note that we introduced a space in number name. This means we have lost the ability to use .dot notation with that column, so, in general: avoid using spaces in your column names.
Some functions can be applied on the entire column. For example, we can sum all the values in a column:
sum(df.threes)
45
Pandas also has its own summary statistics methods, such as .median() and .std(), which we will study later.
As we saw above, we can apply some basic mathematical operations to each of the values:
df.threes**2
0 0 1 9 2 36 3 81 4 144 5 225 Name: threes, dtype: int64
... but what if we want to apply a more complicated function? We can use the .apply() method to apply any function we wish to every element in a column. This can be a pre-existing python function, or we can define our own function:
def doublenegative(val):
"""Returns the negative of the double of val (a simple number)"""
return val * (-2)
df['doublenegs'] = df.threes.apply(doublenegative)
df
| threes | sqrt | number name | doublenegs | |
|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 |
| 1 | 3 | 1.732051 | three | -6 |
| 2 | 6 | 2.449490 | six | -12 |
| 3 | 9 | 3.000000 | nine | -18 |
| 4 | 12 | 3.464102 | twelve | -24 |
| 5 | 15 | 3.872983 | fifteen | -30 |
Fancier still, we can use Python's lambda functions to "define" a function on the fly; i.e., we don't have to create the function beforehand. For example, let's calculate the triple negative of the values:
df['triplenegs'] = df.threes.apply(lambda x: x * (-3))
df
| threes | sqrt | number name | doublenegs | triplenegs | |
|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0 |
| 1 | 3 | 1.732051 | three | -6 | -9 |
| 2 | 6 | 2.449490 | six | -12 | -18 |
| 3 | 9 | 3.000000 | nine | -18 | -27 |
| 4 | 12 | 3.464102 | twelve | -24 | -36 |
| 5 | 15 | 3.872983 | fifteen | -30 | -45 |
A full understanding of the use of lambda above is beyond the scope of this tutorial, but I encourage you to look into it.
Easy:
df.dtypes
threes int64 sqrt float64 number name object doublenegs int64 triplenegs int64 dtype: object
Note: strings (i.e. text) are generically treated as an "object" in pandas, as is pretty much anything that isn't a number or Boolean (True/False).
One way to change the data type of a column is to reassign it to itself with a different type:
df['triplenegs'] = df['triplenegs'].astype('float32')
df.dtypes
threes int64 sqrt float64 number name object doublenegs int64 triplenegs float32 dtype: object
The type of triplenegs has changed from int64 to float32. And if we now print the dataframe again:
df
| threes | sqrt | number name | doublenegs | triplenegs | |
|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0.0 |
| 1 | 3 | 1.732051 | three | -6 | -9.0 |
| 2 | 6 | 2.449490 | six | -12 | -18.0 |
| 3 | 9 | 3.000000 | nine | -18 | -27.0 |
| 4 | 12 | 3.464102 | twelve | -24 | -36.0 |
| 5 | 15 | 3.872983 | fifteen | -30 | -45.0 |
... we see that the triplenegs column has indeed changed from integer to float: it now has a decimal; it's zero in all cases, but it's there!
Row filtering requires a bit of Boolean wizardry: you need to tell pandas which conditions you want your data to match. For example, to find out which of the multiples of three are also even, we ask pandas to show us which of those numbers give a residue (%) of 0 when divided by 2:
df.threes %2 == 0
0 True 1 False 2 True 3 False 4 True 5 False Name: threes, dtype: bool
And as you can see, the result is a bool.
df[ filter ]¶Now that we know how to make a Boolean filter, we can use it to ask pandas to show us only the elements that match that filter. In our example, to get the rows for which the threes column is even, we use the same filter as above and feed it to a [ ] pair for subsetting:
df[ df.threes %2 == 0 ]
| threes | sqrt | number name | doublenegs | triplenegs | |
|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0.0 |
| 2 | 6 | 2.449490 | six | -12 | -18.0 |
| 4 | 12 | 3.464102 | twelve | -24 | -36.0 |
(the spaces inside the brackets are unnecessary, I just put them there for clarity)
This may seem a little confusing at first, particularly because of the double df outside and inside the brackets. What's happening here is the inner code creates the same list of Trues and Falses that we got above; we then send this list to df, so that pandas can decide what to show us from the dataframe: it shows us those rows for which it gets a True, skips those that have a False.
In fact, since we have six rows, we could use any combination of six True/False values:
df[ [True, True, False, False, True, False] ]
| threes | sqrt | number name | doublenegs | triplenegs | |
|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0.0 |
| 1 | 3 | 1.732051 | three | -6 | -9.0 |
| 4 | 12 | 3.464102 | twelve | -24 | -36.0 |
( ) | ( ) and ( ) & ( )¶Let's get all rows for which either the threes column is even, or the value in the number name column contains the text "een":
df[ (df.threes %2 == 0) | (["een" in name for name in df['number name'].values]) ]
| threes | sqrt | number name | doublenegs | triplenegs | |
|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0.0 |
| 2 | 6 | 2.449490 | six | -12 | -18.0 |
| 4 | 12 | 3.464102 | twelve | -24 | -36.0 |
| 5 | 15 | 3.872983 | fifteen | -30 | -45.0 |
(remember that we can't use .dot notation with the number name column anymore because now it contains a space)
We used a list comprehension in the second condition above. Just in case you can't remember from the basic Python workshop how list comprehensions work, here is one to calculate the squares of the numbers from 0 to 9:
[x**2 for x in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
.loc[] and .iloc[] to filter and explore rows and columns¶There are several ways to filter rows. Above we used simply:
df[ df.threes %2 == 0 ]
But this has limitations, and there are other ways worth exploring. One of the most popular and powerful is .loc[ ].
.loc to access elements in a dataframe¶If we change the first df in the code above to df.loc, we get what seems to be the same result:
df.loc[ df.threes %2 == 0 ]
| threes | sqrt | number name | doublenegs | triplenegs | |
|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0.0 |
| 2 | 6 | 2.449490 | six | -12 | -18.0 |
| 4 | 12 | 3.464102 | twelve | -24 | -36.0 |
You may wonder why you would ever want to write df.loc when df alone suffices. The main advantage of df.loc[] over df[] is that df.loc[] gives you access to the dataframe itself, so you can make changes to it (as we will see soon), whereas with df[ ] notation you only get a copy of the data, so you can't make any permanent changes to it.
.loc can filter columns too. All we need to do is separate the row filter from the column filter using a comma. We must then specify which columns we want, which could simply be by name:
df.loc[ df.threes > 5 , ['threes', 'sqrt']]
| threes | sqrt | |
|---|---|---|
| 2 | 6 | 2.449490 |
| 3 | 9 | 3.000000 |
| 4 | 12 | 3.464102 |
| 5 | 15 | 3.872983 |
Or we could use separate Boolean filters for the row values and column names:
df.loc[ df.threes > 5 , ["leneg" in colname for colname in df.columns] ]
| doublenegs | triplenegs | |
|---|---|---|
| 2 | -12 | -18.0 |
| 3 | -18 | -27.0 |
| 4 | -24 | -36.0 |
| 5 | -30 | -45.0 |
Here we used a list comprehension to tell pandas to get us any column that has the text "leneg" in its name. You can use any valid filter you can come up with.
Ultimately, what needs to happen here is we must give either specific names of the columns, or a list of Trues and Falses for both the rows and the columns:
df.loc[ [True, True, False, False, True, True], [False, True, False, True, True ]]
| sqrt | doublenegs | triplenegs | |
|---|---|---|---|
| 0 | 0.000000 | 0 | 0.0 |
| 1 | 1.732051 | -6 | -9.0 |
| 4 | 3.464102 | -24 | -36.0 |
| 5 | 3.872983 | -30 | -45.0 |
⚠️ Note that the rows list must be 6 items long because we have six rows, and the columns list must be 5 items long because that's the number of columns in this dataframe at present. Any other lengths will throw an error.
The .iloc[] property gives us access to the rows of the dataframe by numbered position, starting at 0:
df.iloc[1]
threes 3 sqrt 1.732051 number name three doublenegs -6 triplenegs -9.0 Name: 1, dtype: object
We can also give a range, which works as customarily in Python:
df.iloc[1:4]
| threes | sqrt | number name | doublenegs | triplenegs | |
|---|---|---|---|---|---|
| 1 | 3 | 1.732051 | three | -6 | -9.0 |
| 2 | 6 | 2.449490 | six | -12 | -18.0 |
| 3 | 9 | 3.000000 | nine | -18 | -27.0 |
And we can also access the columns in the same way:
df.iloc[ 1:4, 2:5 ]
| number name | doublenegs | triplenegs | |
|---|---|---|---|
| 1 | three | -6 | -9.0 |
| 2 | six | -12 | -18.0 |
| 3 | nine | -18 | -27.0 |
We can take another look at the raw dataframe to make sure that it worked well:
df
| threes | sqrt | number name | doublenegs | triplenegs | |
|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0.0 |
| 1 | 3 | 1.732051 | three | -6 | -9.0 |
| 2 | 6 | 2.449490 | six | -12 | -18.0 |
| 3 | 9 | 3.000000 | nine | -18 | -27.0 |
| 4 | 12 | 3.464102 | twelve | -24 | -36.0 |
| 5 | 15 | 3.872983 | fifteen | -30 | -45.0 |
.loc (or .iloc) to replace values in a dataframe¶Let's create a new column called is_even, and define it to False for starters:
df['is_even'] = False
df
| threes | sqrt | number name | doublenegs | triplenegs | is_even | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0.0 | False |
| 1 | 3 | 1.732051 | three | -6 | -9.0 | False |
| 2 | 6 | 2.449490 | six | -12 | -18.0 | False |
| 3 | 9 | 3.000000 | nine | -18 | -27.0 | False |
| 4 | 12 | 3.464102 | twelve | -24 | -36.0 | False |
| 5 | 15 | 3.872983 | fifteen | -30 | -45.0 | False |
Here's where .loc shows its powers. If we were to try to use df[ condition ][column] to try to change the value, the operation would not be permanent, because df[ ] only produces a copy of the data, it doesn't give access to the dataframe itself. In contrast, .loc both returns what we're looking for, and it gives us access to it, so we can use it to change values:
df.loc[df.threes %2 == 0, 'is_even'] = True
df
| threes | sqrt | number name | doublenegs | triplenegs | is_even | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0.000000 | zero | 0 | 0.0 | True |
| 1 | 3 | 1.732051 | three | -6 | -9.0 | False |
| 2 | 6 | 2.449490 | six | -12 | -18.0 | True |
| 3 | 9 | 3.000000 | nine | -18 | -27.0 | False |
| 4 | 12 | 3.464102 | twelve | -24 | -36.0 | True |
| 5 | 15 | 3.872983 | fifteen | -30 | -45.0 | False |
Note that .iloc functions in a similar way to .loc, i.e. you can use it to change values as well as just look at them.
.csv or .tsv file¶This is trivially done. The default is comma-separated values (CSV), but if you want tab-separated (TSV) instead — or something else — you can just specify a separator with sep:
df.to_csv('my_pandas_df.tsv', sep='\t')
.csv or .tsv file¶Reading data in is just as easy as writing. We do it with:
df.read_csv('the_file.tsv', sep='\t')
You don't need to specify a separator if the data is in CSV.
Since we're already familiar with many of the basics of pandas, let's load in a richer dataset (which you hopefully downloaded from my GitHub along with this notebook). This dataset contains a few details on phone and internet connectivity per country that I downloaded from the WorldBank. I did a bit of pre-cleaning in Excel and now this is what we have:
wbdf = pd.read_csv('Data/WB_Countries_Infrastructure_2015_2019.csv')
wbdf
| Country Name | Country Code | Indicator | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | Fixed broadband subscriptions | 19000.0 | 19000.0 | 19000.0 | 19000.0 | 19000.0 |
| 1 | Aruba | ABW | Fixed telephone subscriptions | 35000.0 | 35000.0 | 35000.0 | 35000.0 | 35000.0 |
| 2 | Aruba | ABW | Mobile cellular subscriptions | 141000.0 | 141000.0 | 141000.0 | 141000.0 | 141000.0 |
| 3 | Afghanistan | AFG | Fixed broadband subscriptions | 7067.0 | 8801.0 | 16810.0 | 17491.0 | 19683.0 |
| 4 | Afghanistan | AFG | Fixed telephone subscriptions | 110000.0 | 114192.0 | 118769.0 | 127794.0 | 134636.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 646 | Zambia | ZMB | Fixed telephone subscriptions | 116165.0 | 101407.0 | 101444.0 | 100444.0 | 96719.0 |
| 647 | Zambia | ZMB | Mobile cellular subscriptions | 11557700.0 | 12017000.0 | 13438500.0 | 15470300.0 | 17220600.0 |
| 648 | Zimbabwe | ZWE | Fixed broadband subscriptions | 163987.0 | 170838.0 | 187310.0 | 203056.0 | 204424.0 |
| 649 | Zimbabwe | ZWE | Fixed telephone subscriptions | 333702.0 | 305720.0 | 264150.0 | 268849.0 | 265734.0 |
| 650 | Zimbabwe | ZWE | Mobile cellular subscriptions | 12757400.0 | 12878900.0 | 14092100.0 | 12909000.0 | 13195900.0 |
651 rows × 8 columns
Note that this data is in "stacked" or "record" format, as is most of the data from the UN, WHO, and WorldBank. We need to transform it so that each country appears only once, and we have the indicator values for each year as columns.
pivot to transform a stacked dataframe to a multi-indexed one¶# Pivot the dataframe to unstack it:
wbdf = wbdf.pivot(index=['Country Code', 'Country Name'],
columns='Indicator',
values=['2015', '2016', '2017', '2018', '2019']
)
# And print it:
wbdf
| 2015 | 2016 | 2017 | 2018 | 2019 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Indicator | Fixed broadband subscriptions | Fixed telephone subscriptions | Mobile cellular subscriptions | Fixed broadband subscriptions | Fixed telephone subscriptions | Mobile cellular subscriptions | Fixed broadband subscriptions | Fixed telephone subscriptions | Mobile cellular subscriptions | Fixed broadband subscriptions | Fixed telephone subscriptions | Mobile cellular subscriptions | Fixed broadband subscriptions | Fixed telephone subscriptions | Mobile cellular subscriptions | |
| Country Code | Country Name | |||||||||||||||
| ABW | Aruba | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 |
| AFG | Afghanistan | 7067.0 | 110000.0 | 19709000.0 | 8801.0 | 114192.0 | 21603000.0 | 16810.0 | 118769.0 | 23929700.0 | 17491.0 | 127794.0 | 21976400.0 | 19683.0 | 134636.0 | 22580100.0 |
| AGO | Angola | 153571.0 | 284925.0 | 13884500.0 | 123135.0 | 304493.0 | 13001100.0 | 96919.0 | 161070.0 | 13324000.0 | 109561.0 | 171858.0 | 13288400.0 | 119047.0 | 124726.0 | 14830200.0 |
| ALB | Albania | 242870.0 | 226718.0 | 3400960.0 | 266379.0 | 248640.0 | 3369760.0 | 303612.0 | 246720.0 | 3625700.0 | 360761.0 | 247428.0 | 2714880.0 | 436192.0 | 242859.0 | 2630080.0 |
| AND | Andorra | 30694.0 | 38850.0 | 71336.0 | 32490.0 | 38694.0 | 76132.0 | 34284.0 | 38411.0 | 80337.0 | 35663.0 | 39375.0 | 82614.0 | 36358.0 | 39657.0 | 87909.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| XKX | Kosovo | NaN | 106300.0 | 562000.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 106300.0 | 562000.0 |
| YEM | Yemen, Rep. | 395000.0 | 1142230.0 | 15022000.0 | 430400.0 | 1165830.0 | 16433100.0 | 469000.0 | 1190000.0 | 15357900.0 | 469000.0 | 1190000.0 | 15357900.0 | 469000.0 | 1190000.0 | 15357900.0 |
| ZAF | South Africa | 1409350.0 | 4131060.0 | 87999500.0 | 1150770.0 | 4522850.0 | 82412900.0 | 1123190.0 | 4810070.0 | 88497600.0 | 1386840.0 | 3103750.0 | 88567000.0 | 1250360.0 | 2024730.0 | 96972500.0 |
| ZMB | Zambia | 23390.0 | 116165.0 | 11557700.0 | 31784.0 | 101407.0 | 12017000.0 | 35912.0 | 101444.0 | 13438500.0 | 43365.0 | 100444.0 | 15470300.0 | 88891.0 | 96719.0 | 17220600.0 |
| ZWE | Zimbabwe | 163987.0 | 333702.0 | 12757400.0 | 170838.0 | 305720.0 | 12878900.0 | 187310.0 | 264150.0 | 14092100.0 | 203056.0 | 268849.0 | 12909000.0 | 204424.0 | 265734.0 | 13195900.0 |
217 rows × 15 columns
This is very good. We now have a multi-indexed dataframe. We can use this multi-index to get only the values for 2015:
wbdf['2015']
| Indicator | Fixed broadband subscriptions | Fixed telephone subscriptions | Mobile cellular subscriptions | |
|---|---|---|---|---|
| Country Code | Country Name | |||
| ABW | Aruba | 19000.0 | 35000.0 | 141000.0 |
| AFG | Afghanistan | 7067.0 | 110000.0 | 19709000.0 |
| AGO | Angola | 153571.0 | 284925.0 | 13884500.0 |
| ALB | Albania | 242870.0 | 226718.0 | 3400960.0 |
| AND | Andorra | 30694.0 | 38850.0 | 71336.0 |
| ... | ... | ... | ... | ... |
| XKX | Kosovo | NaN | 106300.0 | 562000.0 |
| YEM | Yemen, Rep. | 395000.0 | 1142230.0 | 15022000.0 |
| ZAF | South Africa | 1409350.0 | 4131060.0 | 87999500.0 |
| ZMB | Zambia | 23390.0 | 116165.0 | 11557700.0 |
| ZWE | Zimbabwe | 163987.0 | 333702.0 | 12757400.0 |
217 rows × 3 columns
And if we want to get a column therein, all we have to do is filter twice:
wbdf['2015']['Fixed broadband subscriptions']
Country Code Country Name
ABW Aruba 19000.0
AFG Afghanistan 7067.0
AGO Angola 153571.0
ALB Albania 242870.0
AND Andorra 30694.0
...
XKX Kosovo NaN
YEM Yemen, Rep. 395000.0
ZAF South Africa 1409350.0
ZMB Zambia 23390.0
ZWE Zimbabwe 163987.0
Name: Fixed broadband subscriptions, Length: 217, dtype: float64
And further down the rabbit hole, if we want the values for a specific country, we filter thrice (using the country code, which is our main index):
wbdf['2015']['Fixed broadband subscriptions']['ZMB']
Country Name Zambia 23390.0 Name: Fixed broadband subscriptions, dtype: float64
The multi-index dataframe that we just built is very cool, but it may get a little annoying for some purposes. For example, we can't immediately access the country name as we normally would by issuing:
wbdf['Country Name']
...it throws an error. Similarly, trying to access any of the indicator columns also throws an error (we have to specify the year first). So, we'll need to flatten the dataframe to make it more like a normal pandas dataframe.
Let's first take a look at the names of the columns:
wbdf.columns
MultiIndex([('2015', 'Fixed broadband subscriptions'),
('2015', 'Fixed telephone subscriptions'),
('2015', 'Mobile cellular subscriptions'),
('2016', 'Fixed broadband subscriptions'),
('2016', 'Fixed telephone subscriptions'),
('2016', 'Mobile cellular subscriptions'),
('2017', 'Fixed broadband subscriptions'),
('2017', 'Fixed telephone subscriptions'),
('2017', 'Mobile cellular subscriptions'),
('2018', 'Fixed broadband subscriptions'),
('2018', 'Fixed telephone subscriptions'),
('2018', 'Mobile cellular subscriptions'),
('2019', 'Fixed broadband subscriptions'),
('2019', 'Fixed telephone subscriptions'),
('2019', 'Mobile cellular subscriptions')],
names=[None, 'Indicator'])
It would seem the indices are tuples. We need to flatten those. But we do want to keep the year, so let's do that through a for loop that combines both the name of the indicator and the year:
for year, indicator in wbdf.columns:
print(year, indicator)
2015 Fixed broadband subscriptions 2015 Fixed telephone subscriptions 2015 Mobile cellular subscriptions 2016 Fixed broadband subscriptions 2016 Fixed telephone subscriptions 2016 Mobile cellular subscriptions 2017 Fixed broadband subscriptions 2017 Fixed telephone subscriptions 2017 Mobile cellular subscriptions 2018 Fixed broadband subscriptions 2018 Fixed telephone subscriptions 2018 Mobile cellular subscriptions 2019 Fixed broadband subscriptions 2019 Fixed telephone subscriptions 2019 Mobile cellular subscriptions
Ok, we can do some fancy rearranging and replacement of spaces for underscores _ in the indicator name, and we get:
for year, indicator in wbdf.columns:
print(f"{indicator.replace(' ', '_')}_{year}")
Fixed_broadband_subscriptions_2015 Fixed_telephone_subscriptions_2015 Mobile_cellular_subscriptions_2015 Fixed_broadband_subscriptions_2016 Fixed_telephone_subscriptions_2016 Mobile_cellular_subscriptions_2016 Fixed_broadband_subscriptions_2017 Fixed_telephone_subscriptions_2017 Mobile_cellular_subscriptions_2017 Fixed_broadband_subscriptions_2018 Fixed_telephone_subscriptions_2018 Mobile_cellular_subscriptions_2018 Fixed_broadband_subscriptions_2019 Fixed_telephone_subscriptions_2019 Mobile_cellular_subscriptions_2019
Good, now we can use this to feed a list of new column names. Actually, we can make it even fancier and use a list comprehension instead of the classic for loop:
[f"{indicator.replace(' ', '_')}_{year}" for year, indicator in wbdf.columns]
['Fixed_broadband_subscriptions_2015', 'Fixed_telephone_subscriptions_2015', 'Mobile_cellular_subscriptions_2015', 'Fixed_broadband_subscriptions_2016', 'Fixed_telephone_subscriptions_2016', 'Mobile_cellular_subscriptions_2016', 'Fixed_broadband_subscriptions_2017', 'Fixed_telephone_subscriptions_2017', 'Mobile_cellular_subscriptions_2017', 'Fixed_broadband_subscriptions_2018', 'Fixed_telephone_subscriptions_2018', 'Mobile_cellular_subscriptions_2018', 'Fixed_broadband_subscriptions_2019', 'Fixed_telephone_subscriptions_2019', 'Mobile_cellular_subscriptions_2019']
Now we can use this list to replace the column names in the dataframe
wbdf.columns = [f"{indicator.replace(' ', '_')}_{year}" for year, indicator in wbdf.columns]
wbdf
| Fixed_broadband_subscriptions_2015 | Fixed_telephone_subscriptions_2015 | Mobile_cellular_subscriptions_2015 | Fixed_broadband_subscriptions_2016 | Fixed_telephone_subscriptions_2016 | Mobile_cellular_subscriptions_2016 | Fixed_broadband_subscriptions_2017 | Fixed_telephone_subscriptions_2017 | Mobile_cellular_subscriptions_2017 | Fixed_broadband_subscriptions_2018 | Fixed_telephone_subscriptions_2018 | Mobile_cellular_subscriptions_2018 | Fixed_broadband_subscriptions_2019 | Fixed_telephone_subscriptions_2019 | Mobile_cellular_subscriptions_2019 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country Code | Country Name | |||||||||||||||
| ABW | Aruba | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 |
| AFG | Afghanistan | 7067.0 | 110000.0 | 19709000.0 | 8801.0 | 114192.0 | 21603000.0 | 16810.0 | 118769.0 | 23929700.0 | 17491.0 | 127794.0 | 21976400.0 | 19683.0 | 134636.0 | 22580100.0 |
| AGO | Angola | 153571.0 | 284925.0 | 13884500.0 | 123135.0 | 304493.0 | 13001100.0 | 96919.0 | 161070.0 | 13324000.0 | 109561.0 | 171858.0 | 13288400.0 | 119047.0 | 124726.0 | 14830200.0 |
| ALB | Albania | 242870.0 | 226718.0 | 3400960.0 | 266379.0 | 248640.0 | 3369760.0 | 303612.0 | 246720.0 | 3625700.0 | 360761.0 | 247428.0 | 2714880.0 | 436192.0 | 242859.0 | 2630080.0 |
| AND | Andorra | 30694.0 | 38850.0 | 71336.0 | 32490.0 | 38694.0 | 76132.0 | 34284.0 | 38411.0 | 80337.0 | 35663.0 | 39375.0 | 82614.0 | 36358.0 | 39657.0 | 87909.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| XKX | Kosovo | NaN | 106300.0 | 562000.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 106300.0 | 562000.0 |
| YEM | Yemen, Rep. | 395000.0 | 1142230.0 | 15022000.0 | 430400.0 | 1165830.0 | 16433100.0 | 469000.0 | 1190000.0 | 15357900.0 | 469000.0 | 1190000.0 | 15357900.0 | 469000.0 | 1190000.0 | 15357900.0 |
| ZAF | South Africa | 1409350.0 | 4131060.0 | 87999500.0 | 1150770.0 | 4522850.0 | 82412900.0 | 1123190.0 | 4810070.0 | 88497600.0 | 1386840.0 | 3103750.0 | 88567000.0 | 1250360.0 | 2024730.0 | 96972500.0 |
| ZMB | Zambia | 23390.0 | 116165.0 | 11557700.0 | 31784.0 | 101407.0 | 12017000.0 | 35912.0 | 101444.0 | 13438500.0 | 43365.0 | 100444.0 | 15470300.0 | 88891.0 | 96719.0 | 17220600.0 |
| ZWE | Zimbabwe | 163987.0 | 333702.0 | 12757400.0 | 170838.0 | 305720.0 | 12878900.0 | 187310.0 | 264150.0 | 14092100.0 | 203056.0 | 268849.0 | 12909000.0 | 204424.0 | 265734.0 | 13195900.0 |
217 rows × 15 columns
This is looking good, but you'll notice that we still have a multi-level index. There are many ways to resolve this, but one I would like here is to add a new index to the dataframe, so that it has consecutive numbers as index like a default pandas dataframe. This will have the added advantage of allowing us to now use the Country Code and Country Name as regular pandas columns.
Resetting the index of a dataframe adds a new 0..1..2..N index to the dataframe, and releases any oreset_index column(s) that may have been previously used as index, turning them into regular columns.
wbdf.reset_index(inplace=True)
wbdf
| Country Code | Country Name | Fixed_broadband_subscriptions_2015 | Fixed_telephone_subscriptions_2015 | Mobile_cellular_subscriptions_2015 | Fixed_broadband_subscriptions_2016 | Fixed_telephone_subscriptions_2016 | Mobile_cellular_subscriptions_2016 | Fixed_broadband_subscriptions_2017 | Fixed_telephone_subscriptions_2017 | Mobile_cellular_subscriptions_2017 | Fixed_broadband_subscriptions_2018 | Fixed_telephone_subscriptions_2018 | Mobile_cellular_subscriptions_2018 | Fixed_broadband_subscriptions_2019 | Fixed_telephone_subscriptions_2019 | Mobile_cellular_subscriptions_2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABW | Aruba | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 |
| 1 | AFG | Afghanistan | 7067.0 | 110000.0 | 19709000.0 | 8801.0 | 114192.0 | 21603000.0 | 16810.0 | 118769.0 | 23929700.0 | 17491.0 | 127794.0 | 21976400.0 | 19683.0 | 134636.0 | 22580100.0 |
| 2 | AGO | Angola | 153571.0 | 284925.0 | 13884500.0 | 123135.0 | 304493.0 | 13001100.0 | 96919.0 | 161070.0 | 13324000.0 | 109561.0 | 171858.0 | 13288400.0 | 119047.0 | 124726.0 | 14830200.0 |
| 3 | ALB | Albania | 242870.0 | 226718.0 | 3400960.0 | 266379.0 | 248640.0 | 3369760.0 | 303612.0 | 246720.0 | 3625700.0 | 360761.0 | 247428.0 | 2714880.0 | 436192.0 | 242859.0 | 2630080.0 |
| 4 | AND | Andorra | 30694.0 | 38850.0 | 71336.0 | 32490.0 | 38694.0 | 76132.0 | 34284.0 | 38411.0 | 80337.0 | 35663.0 | 39375.0 | 82614.0 | 36358.0 | 39657.0 | 87909.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 212 | XKX | Kosovo | NaN | 106300.0 | 562000.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 106300.0 | 562000.0 |
| 213 | YEM | Yemen, Rep. | 395000.0 | 1142230.0 | 15022000.0 | 430400.0 | 1165830.0 | 16433100.0 | 469000.0 | 1190000.0 | 15357900.0 | 469000.0 | 1190000.0 | 15357900.0 | 469000.0 | 1190000.0 | 15357900.0 |
| 214 | ZAF | South Africa | 1409350.0 | 4131060.0 | 87999500.0 | 1150770.0 | 4522850.0 | 82412900.0 | 1123190.0 | 4810070.0 | 88497600.0 | 1386840.0 | 3103750.0 | 88567000.0 | 1250360.0 | 2024730.0 | 96972500.0 |
| 215 | ZMB | Zambia | 23390.0 | 116165.0 | 11557700.0 | 31784.0 | 101407.0 | 12017000.0 | 35912.0 | 101444.0 | 13438500.0 | 43365.0 | 100444.0 | 15470300.0 | 88891.0 | 96719.0 | 17220600.0 |
| 216 | ZWE | Zimbabwe | 163987.0 | 333702.0 | 12757400.0 | 170838.0 | 305720.0 | 12878900.0 | 187310.0 | 264150.0 | 14092100.0 | 203056.0 | 268849.0 | 12909000.0 | 204424.0 | 265734.0 | 13195900.0 |
217 rows × 17 columns
Nice. You will see that now our indices are flat and our dataframe looks just like any regular pandas dataframe.
Even better, we can now access the Country Code and Country Name as regular columns:
wbdf['Country Name']
0 Aruba
1 Afghanistan
2 Angola
3 Albania
4 Andorra
...
212 Kosovo
213 Yemen, Rep.
214 South Africa
215 Zambia
216 Zimbabwe
Name: Country Name, Length: 217, dtype: object
Or we can also get multiple columns:
wbdf[['Country Code', 'Country Name', 'Mobile_cellular_subscriptions_2019']]
| Country Code | Country Name | Mobile_cellular_subscriptions_2019 | |
|---|---|---|---|
| 0 | ABW | Aruba | 141000.0 |
| 1 | AFG | Afghanistan | 22580100.0 |
| 2 | AGO | Angola | 14830200.0 |
| 3 | ALB | Albania | 2630080.0 |
| 4 | AND | Andorra | 87909.0 |
| ... | ... | ... | ... |
| 212 | XKX | Kosovo | 562000.0 |
| 213 | YEM | Yemen, Rep. | 15357900.0 |
| 214 | ZAF | South Africa | 96972500.0 |
| 215 | ZMB | Zambia | 17220600.0 |
| 216 | ZWE | Zimbabwe | 13195900.0 |
217 rows × 3 columns
You will have noticed that pandas is very friendly in not printing the entire rows to screen. But what if you do want them all? You can either print the desired column as a list with list(df.mycol), or, to stay within pandas, you can ask for the .values attribute:
wbdf['Country Code'].values
array(['ABW', 'AFG', 'AGO', 'ALB', 'AND', 'ARE', 'ARG', 'ARM', 'ASM',
'ATG', 'AUS', 'AUT', 'AZE', 'BDI', 'BEL', 'BEN', 'BFA', 'BGD',
'BGR', 'BHR', 'BHS', 'BIH', 'BLR', 'BLZ', 'BMU', 'BOL', 'BRA',
'BRB', 'BRN', 'BTN', 'BWA', 'CAF', 'CAN', 'CHE', 'CHI', 'CHL',
'CHN', 'CIV', 'CMR', 'COD', 'COG', 'COL', 'COM', 'CPV', 'CRI',
'CUB', 'CUW', 'CYM', 'CYP', 'CZE', 'DEU', 'DJI', 'DMA', 'DNK',
'DOM', 'DZA', 'ECU', 'EGY', 'ERI', 'ESP', 'EST', 'ETH', 'FIN',
'FJI', 'FRA', 'FRO', 'FSM', 'GAB', 'GBR', 'GEO', 'GHA', 'GIB',
'GIN', 'GMB', 'GNB', 'GNQ', 'GRC', 'GRD', 'GRL', 'GTM', 'GUM',
'GUY', 'HKG', 'HND', 'HRV', 'HTI', 'HUN', 'IDN', 'IMN', 'IND',
'IRL', 'IRN', 'IRQ', 'ISL', 'ISR', 'ITA', 'JAM', 'JOR', 'JPN',
'KAZ', 'KEN', 'KGZ', 'KHM', 'KIR', 'KNA', 'KOR', 'KWT', 'LAO',
'LBN', 'LBR', 'LBY', 'LCA', 'LIE', 'LKA', 'LSO', 'LTU', 'LUX',
'LVA', 'MAC', 'MAF', 'MAR', 'MCO', 'MDA', 'MDG', 'MDV', 'MEX',
'MHL', 'MKD', 'MLI', 'MLT', 'MMR', 'MNE', 'MNG', 'MNP', 'MOZ',
'MRT', 'MUS', 'MWI', 'MYS', 'NAM', 'NCL', 'NER', 'NGA', 'NIC',
'NLD', 'NOR', 'NPL', 'NRU', 'NZL', 'OMN', 'PAK', 'PAN', 'PER',
'PHL', 'PLW', 'PNG', 'POL', 'PRI', 'PRK', 'PRT', 'PRY', 'PSE',
'PYF', 'QAT', 'ROU', 'RUS', 'RWA', 'SAU', 'SDN', 'SEN', 'SGP',
'SLB', 'SLE', 'SLV', 'SMR', 'SOM', 'SRB', 'SSD', 'STP', 'SUR',
'SVK', 'SVN', 'SWE', 'SWZ', 'SXM', 'SYC', 'SYR', 'TCA', 'TCD',
'TGO', 'THA', 'TJK', 'TKM', 'TLS', 'TON', 'TTO', 'TUN', 'TUR',
'TUV', 'TZA', 'UGA', 'UKR', 'URY', 'USA', 'UZB', 'VCT', 'VEN',
'VGB', 'VIR', 'VNM', 'VUT', 'WSM', 'XKX', 'YEM', 'ZAF', 'ZMB',
'ZWE'], dtype=object)
⚠️ Extracting the .values is also helpful for columns with long text, which pandas cuts out by default.⚠️
wbdf.describe()
| Fixed_broadband_subscriptions_2015 | Fixed_telephone_subscriptions_2015 | Mobile_cellular_subscriptions_2015 | Fixed_broadband_subscriptions_2016 | Fixed_telephone_subscriptions_2016 | Mobile_cellular_subscriptions_2016 | Fixed_broadband_subscriptions_2017 | Fixed_telephone_subscriptions_2017 | Mobile_cellular_subscriptions_2017 | Fixed_broadband_subscriptions_2018 | Fixed_telephone_subscriptions_2018 | Mobile_cellular_subscriptions_2018 | Fixed_broadband_subscriptions_2019 | Fixed_telephone_subscriptions_2019 | Mobile_cellular_subscriptions_2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.070000e+02 | 2.130000e+02 | 2.090000e+02 | 2.050000e+02 | 2.120000e+02 | 2.080000e+02 | 2.060000e+02 | 2.120000e+02 | 2.080000e+02 | 2.060000e+02 | 2.120000e+02 | 2.070000e+02 | 2.070000e+02 | 2.130000e+02 | 2.090000e+02 |
| mean | 4.036865e+06 | 4.846586e+06 | 3.422394e+07 | 4.444971e+06 | 4.697239e+06 | 3.596521e+07 | 4.969771e+06 | 4.563842e+06 | 3.712421e+07 | 5.188242e+06 | 4.480567e+06 | 3.818165e+07 | 5.449234e+06 | 4.308571e+06 | 3.907369e+07 |
| std | 2.104877e+07 | 1.954467e+07 | 1.210942e+08 | 2.418216e+07 | 1.813363e+07 | 1.308477e+08 | 2.892290e+07 | 1.733523e+07 | 1.383219e+08 | 2.993288e+07 | 1.662554e+07 | 1.466957e+08 | 3.260127e+07 | 1.663955e+07 | 1.502775e+08 |
| min | 1.170000e+02 | 0.000000e+00 | 6.600000e+03 | 6.600000e+01 | 0.000000e+00 | 7.600000e+03 | 7.600000e+01 | 0.000000e+00 | 8.000000e+03 | 8.400000e+01 | 0.000000e+00 | 8.000000e+03 | 6.800000e+01 | 0.000000e+00 | 8.000000e+03 |
| 25% | 1.603600e+04 | 4.300000e+04 | 1.007890e+06 | 1.797300e+04 | 3.851275e+04 | 1.111508e+06 | 1.900150e+04 | 3.583675e+04 | 1.144542e+06 | 1.882750e+04 | 3.748200e+04 | 1.245880e+06 | 2.007950e+04 | 3.685500e+04 | 1.151000e+06 |
| 50% | 1.699390e+05 | 3.448530e+05 | 7.079250e+06 | 2.126000e+05 | 3.611100e+05 | 7.084140e+06 | 2.387055e+05 | 3.226345e+05 | 7.125965e+06 | 2.322460e+05 | 3.222170e+05 | 7.197000e+06 | 2.690910e+05 | 3.027540e+05 | 7.119000e+06 |
| 75% | 1.758845e+06 | 2.222370e+06 | 2.085050e+07 | 1.712000e+06 | 2.327262e+06 | 2.191188e+07 | 2.081332e+06 | 2.426678e+06 | 2.255328e+07 | 2.143475e+06 | 2.368532e+06 | 2.196600e+07 | 2.149490e+06 | 2.024730e+06 | 2.195560e+07 |
| max | 2.770460e+08 | 2.309960e+08 | 1.291980e+09 | 3.225970e+08 | 2.066240e+08 | 1.364930e+09 | 3.941900e+08 | 1.937570e+08 | 1.469880e+09 | 4.073820e+08 | 1.822480e+08 | 1.641150e+09 | 4.492790e+08 | 1.910330e+08 | 1.725700e+09 |
Some further useful information can be found with .info(), most notably the number of entries (which is also above in "count"), and the data type (which we could instead get with wbdf.dtypes)
wbdf.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 217 entries, 0 to 216 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country Code 217 non-null object 1 Country Name 217 non-null object 2 Fixed_broadband_subscriptions_2015 207 non-null float64 3 Fixed_telephone_subscriptions_2015 213 non-null float64 4 Mobile_cellular_subscriptions_2015 209 non-null float64 5 Fixed_broadband_subscriptions_2016 205 non-null float64 6 Fixed_telephone_subscriptions_2016 212 non-null float64 7 Mobile_cellular_subscriptions_2016 208 non-null float64 8 Fixed_broadband_subscriptions_2017 206 non-null float64 9 Fixed_telephone_subscriptions_2017 212 non-null float64 10 Mobile_cellular_subscriptions_2017 208 non-null float64 11 Fixed_broadband_subscriptions_2018 206 non-null float64 12 Fixed_telephone_subscriptions_2018 212 non-null float64 13 Mobile_cellular_subscriptions_2018 207 non-null float64 14 Fixed_broadband_subscriptions_2019 207 non-null float64 15 Fixed_telephone_subscriptions_2019 213 non-null float64 16 Mobile_cellular_subscriptions_2019 209 non-null float64 dtypes: float64(15), object(2) memory usage: 28.9+ KB
The .describe() method provides a general view of several useful summary statistics, such as the mean, count, median (i.e., the 50% quantile), min, max, and standard deviation (std). All of these have their own methods too, which we can apply either on the entire dataframe or on specific columns.
If we run one of these summary methods on the entire dataframe, pandas calculates it for all possible columns, as shown below.
wbdf.mean()
/var/folders/z8/g_4vn19s72gbcbxyx5s3vg5c0000gn/T/ipykernel_85707/588518059.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. wbdf.mean()
Fixed_broadband_subscriptions_2015 4.036865e+06 Fixed_telephone_subscriptions_2015 4.846586e+06 Mobile_cellular_subscriptions_2015 3.422394e+07 Fixed_broadband_subscriptions_2016 4.444971e+06 Fixed_telephone_subscriptions_2016 4.697239e+06 Mobile_cellular_subscriptions_2016 3.596521e+07 Fixed_broadband_subscriptions_2017 4.969771e+06 Fixed_telephone_subscriptions_2017 4.563842e+06 Mobile_cellular_subscriptions_2017 3.712421e+07 Fixed_broadband_subscriptions_2018 5.188242e+06 Fixed_telephone_subscriptions_2018 4.480567e+06 Mobile_cellular_subscriptions_2018 3.818165e+07 Fixed_broadband_subscriptions_2019 5.449234e+06 Fixed_telephone_subscriptions_2019 4.308571e+06 Mobile_cellular_subscriptions_2019 3.907369e+07 dtype: float64
(you will probably see a warning that you should drop the invalid columns before you compute the mean, because future versions of pandas won't filter them out by default. The problem is we have a couple of string (oject) columns with the country names and codes, and of course it's not possible to calculate their mean or std. For now, pandas will deal with it, but future versions will throw an error. As a homework, go ahead and change the code so that the text columns are filtered out and the warning disappears)
More typically, instead of calculating a summary statistic for the entire dataframe, we will want to do so on a specific column:
wbdf['Fixed_broadband_subscriptions_2015'].std()
21048770.1828913
While .min() and .max() tell us what the minimum and maximum values are, they don't tell us which rows (i.e. countries in this case) they belong to. The .idxmax() and .idxmin() methods give the index of the rows that contain the maximum and minimum values for a given column:
# Define a variable to hold the desired column name
desired_col = 'Mobile_cellular_subscriptions_2019'
# Determine the rows that contain the min and max values for the desired col
minrow = wbdf[desired_col].idxmin()
maxrow = wbdf[desired_col].idxmax()
# Now, print the rows to screen
print(f"The row with LEAST {desired_col} is: {minrow}")
print(f"The row with MOST {desired_col} is: {maxrow}")
The row with LEAST Mobile_cellular_subscriptions_2019 is: 198 The row with MOST Mobile_cellular_subscriptions_2019 is: 36
On their own, these indexes are not very informative, but we can use them to extract the names of the corresponding countries using .iloc:
print(f"The country with LEAST {desired_col} was: {wbdf.iloc[minrow]['Country Name']}")
print(f"The country with MOST {desired_col} was: {wbdf.iloc[maxrow]['Country Name']}")
The country with LEAST Mobile_cellular_subscriptions_2019 was: Tuvalu The country with MOST Mobile_cellular_subscriptions_2019 was: China
Let's load another very interesting set of country indicator data.
wbdf2 = pd.read_csv('Data/WB_World_Demography_Progress_2016_unstacked.csv')
wbdf2
| Country | Region | Population | Income Group | CO2 emissions (metric tons per capita) | % Literacy, adult female | % Literacy, adult male | % Primary completion, female | % Primary completion, male | % Lower secondary completion, female | % Lower secondary completion, male | % Parliament seats held by women | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | South Asia | 35383128.0 | Low income | 0.245101 | NaN | NaN | 63.604069 | 96.542091 | 36.819771 | 63.874481 | 27.710843 |
| 1 | Albania | Europe & Central Asia | 2876101.0 | Upper middle income | 1.577163 | NaN | NaN | 102.297653 | 97.790390 | NaN | NaN | 22.857143 |
| 2 | Algeria | Middle East & North Africa | 40551404.0 | Lower middle income | 3.699156 | NaN | NaN | 105.560043 | 105.618858 | 87.665314 | 70.962387 | 31.601732 |
| 3 | American Samoa | East Asia & Pacific | 55741.0 | Upper middle income | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Andorra | Europe & Central Asia | 77297.0 | High income | 6.072370 | NaN | NaN | NaN | NaN | NaN | NaN | 32.142857 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 212 | Virgin Islands (U.S.) | Latin America & Caribbean | 107510.0 | High income | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 213 | West Bank and Gaza | Middle East & North Africa | 4367088.0 | Lower middle income | 0.740607 | 95.207909 | 98.607841 | 97.022377 | 97.090050 | 88.554771 | 73.739983 | NaN |
| 214 | Yemen, Rep. | Middle East & North Africa | 27168210.0 | Low income | 0.390480 | NaN | NaN | 64.500092 | 79.805077 | 45.206539 | 60.714600 | 0.000000 |
| 215 | Zambia | Sub-Saharan Africa | 16363507.0 | Lower middle income | 0.314183 | NaN | NaN | NaN | NaN | NaN | NaN | 17.964072 |
| 216 | Zimbabwe | Sub-Saharan Africa | 14030390.0 | Lower middle income | 0.782777 | NaN | NaN | NaN | NaN | NaN | NaN | 31.481481 |
217 rows × 12 columns
(Note that I already "unstacked" (i.e. flattened) this one)
Two things worth noting right from the start:
We'll bump into both of these problems below as we attempt to merge the dataframes.
For now, let's look at a high-level summary/description of the contents of the dataframe:
wbdf2.describe()
| Population | CO2 emissions (metric tons per capita) | % Literacy, adult female | % Literacy, adult male | % Primary completion, female | % Primary completion, male | % Lower secondary completion, female | % Lower secondary completion, male | % Parliament seats held by women | |
|---|---|---|---|---|---|---|---|---|---|
| count | 2.160000e+02 | 217.000000 | 36.000000 | 36.000000 | 133.000000 | 133.000000 | 123.000000 | 123.000000 | 192.000000 |
| mean | 3.424723e+07 | 4.463764 | 88.451383 | 92.211031 | 92.052718 | 92.489008 | 81.361217 | 79.591794 | 20.888705 |
| std | 1.347982e+08 | 5.942907 | 16.220210 | 12.695188 | 16.444161 | 14.688918 | 25.905601 | 24.162610 | 12.018829 |
| min | 1.122500e+04 | 0.000000 | 13.955230 | 31.328791 | 32.834202 | 48.877022 | 7.635840 | 11.978990 | 0.000000 |
| 25% | 7.627018e+05 | 0.595648 | 90.441835 | 92.019470 | 89.022141 | 89.599998 | 66.612473 | 65.939972 | 12.302328 |
| 50% | 6.424154e+06 | 2.495571 | 93.675526 | 96.070908 | 97.593224 | 96.522697 | 90.642464 | 87.019386 | 19.873534 |
| 75% | 2.384867e+07 | 5.948891 | 95.889345 | 98.565899 | 101.701843 | 100.746208 | 98.956196 | 97.036179 | 29.580868 |
| max | 1.378665e+09 | 38.901471 | 99.978409 | 99.994873 | 133.333328 | 128.703705 | 116.886810 | 116.486862 | 63.750000 |
This is very useful. For example, the count row tells us that only 36 countries have an entry for the two % Literacy columns, so these two may not be particularly useful and we may do better to just drop them:
wbdf2.drop(columns=['% Literacy, adult female', '% Literacy, adult male'], inplace=True)
wbdf2.describe()
| Population | CO2 emissions (metric tons per capita) | % Primary completion, female | % Primary completion, male | % Lower secondary completion, female | % Lower secondary completion, male | % Parliament seats held by women | |
|---|---|---|---|---|---|---|---|
| count | 2.160000e+02 | 217.000000 | 133.000000 | 133.000000 | 123.000000 | 123.000000 | 192.000000 |
| mean | 3.424723e+07 | 4.463764 | 92.052718 | 92.489008 | 81.361217 | 79.591794 | 20.888705 |
| std | 1.347982e+08 | 5.942907 | 16.444161 | 14.688918 | 25.905601 | 24.162610 | 12.018829 |
| min | 1.122500e+04 | 0.000000 | 32.834202 | 48.877022 | 7.635840 | 11.978990 | 0.000000 |
| 25% | 7.627018e+05 | 0.595648 | 89.022141 | 89.599998 | 66.612473 | 65.939972 | 12.302328 |
| 50% | 6.424154e+06 | 2.495571 | 97.593224 | 96.522697 | 90.642464 | 87.019386 | 19.873534 |
| 75% | 2.384867e+07 | 5.948891 | 101.701843 | 100.746208 | 98.956196 | 97.036179 | 29.580868 |
| max | 1.378665e+09 | 38.901471 | 133.333328 | 128.703705 | 116.886810 | 116.486862 | 63.750000 |
We now want to build a Frankensteined dataframe with both sets of data (the one in wbdf with the mobile data and so on, and the one in wbdf2 with the more social stuff, plus CO2 emissions). Pandas provides two very handy functions for this purpose: join and merge. With merge, we can specify which column we want to use to merge the two dataframes. Unfortunately, wbdf2 doesn't have a Country Code column, so we will have to use the much more dangerous Country Name. This will likely be problematic, but let's try:
wbmerged = wbdf.merge(wbdf2, left_on='Country Name', right_on='Country')
wbmerged
| Country Code | Country Name | Fixed_broadband_subscriptions_2015 | Fixed_telephone_subscriptions_2015 | Mobile_cellular_subscriptions_2015 | Fixed_broadband_subscriptions_2016 | Fixed_telephone_subscriptions_2016 | Mobile_cellular_subscriptions_2016 | Fixed_broadband_subscriptions_2017 | Fixed_telephone_subscriptions_2017 | ... | Country | Region | Population | Income Group | CO2 emissions (metric tons per capita) | % Primary completion, female | % Primary completion, male | % Lower secondary completion, female | % Lower secondary completion, male | % Parliament seats held by women | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABW | Aruba | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | ... | Aruba | Latin America & Caribbean | 104872.0 | High income | 8.426911 | NaN | NaN | NaN | NaN | NaN |
| 1 | AFG | Afghanistan | 7067.0 | 110000.0 | 19709000.0 | 8801.0 | 114192.0 | 21603000.0 | 16810.0 | 118769.0 | ... | Afghanistan | South Asia | 35383128.0 | Low income | 0.245101 | 63.604069 | 96.542091 | 36.819771 | 63.874481 | 27.710843 |
| 2 | AGO | Angola | 153571.0 | 284925.0 | 13884500.0 | 123135.0 | 304493.0 | 13001100.0 | 96919.0 | 161070.0 | ... | Angola | Sub-Saharan Africa | 28842484.0 | Lower middle income | 1.202861 | NaN | NaN | NaN | NaN | 36.818182 |
| 3 | ALB | Albania | 242870.0 | 226718.0 | 3400960.0 | 266379.0 | 248640.0 | 3369760.0 | 303612.0 | 246720.0 | ... | Albania | Europe & Central Asia | 2876101.0 | Upper middle income | 1.577163 | 102.297653 | 97.790390 | NaN | NaN | 22.857143 |
| 4 | AND | Andorra | 30694.0 | 38850.0 | 71336.0 | 32490.0 | 38694.0 | 76132.0 | 34284.0 | 38411.0 | ... | Andorra | Europe & Central Asia | 77297.0 | High income | 6.072370 | NaN | NaN | NaN | NaN | 32.142857 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 207 | XKX | Kosovo | NaN | 106300.0 | 562000.0 | NaN | NaN | NaN | NaN | NaN | ... | Kosovo | Europe & Central Asia | 1777557.0 | Upper middle income | 5.031520 | NaN | NaN | NaN | NaN | NaN |
| 208 | YEM | Yemen, Rep. | 395000.0 | 1142230.0 | 15022000.0 | 430400.0 | 1165830.0 | 16433100.0 | 469000.0 | 1190000.0 | ... | Yemen, Rep. | Middle East & North Africa | 27168210.0 | Low income | 0.390480 | 64.500092 | 79.805077 | 45.206539 | 60.714600 | 0.000000 |
| 209 | ZAF | South Africa | 1409350.0 | 4131060.0 | 87999500.0 | 1150770.0 | 4522850.0 | 82412900.0 | 1123190.0 | 4810070.0 | ... | South Africa | Sub-Saharan Africa | 56203654.0 | Upper middle income | 8.480658 | 88.068489 | 86.454247 | 83.567780 | 78.010681 | 42.000000 |
| 210 | ZMB | Zambia | 23390.0 | 116165.0 | 11557700.0 | 31784.0 | 101407.0 | 12017000.0 | 35912.0 | 101444.0 | ... | Zambia | Sub-Saharan Africa | 16363507.0 | Lower middle income | 0.314183 | NaN | NaN | NaN | NaN | 17.964072 |
| 211 | ZWE | Zimbabwe | 163987.0 | 333702.0 | 12757400.0 | 170838.0 | 305720.0 | 12878900.0 | 187310.0 | 264150.0 | ... | Zimbabwe | Sub-Saharan Africa | 14030390.0 | Lower middle income | 0.782777 | NaN | NaN | NaN | NaN | 31.481481 |
212 rows × 27 columns
Looks good, the columns are now all together and they seem to be joined correctly. But take a look at the number of rows in the three dataframes (which we do simply by getting their lengths):
print(f"The first dataframe is {len(wbdf)} rows long")
print(f"And the second one is {len(wbdf2)} rows long")
print(f"Yet, the merged one is {len(wbmerged)} rows long")
The first dataframe is 217 rows long And the second one is 217 rows long Yet, the merged one is 212 rows long
It appears we've lost 5 countries! Since we used the country names to merge, there must have been a problem there. Let's compare the two columns.
There are many ways to compare columns (e.g. you could use a for loop and if statements in regular python), but here's a fancy pandas way to find the countries in wbdf2 that are not (~) in the column Country Name of the first wbdf:
wbdf2['Country'][~wbdf2['Country'].isin(wbdf['Country Name'])]
49 Curaçao 52 Côte d'Ivoire 63 Swaziland 103 Korea, Dem. People's Rep. 189 São Tomé and Principe Name: Country, dtype: object
(I do not expect you to fully understand what that code is doing. For now, just do what most people do: copy and paste it from StackOverflow and then try to understand it if you want to and have the time, otherwise just make sure that it does what you need and move on. Briefly, it would read something like: "get the column Country from wbdf2, and filter it by those elements in the same column that are not in the Country Name column in wbdf.")
Let's do it the other way around, to find the countries that are in the first but not the second:
wbdf['Country Name'][~wbdf['Country Name'].isin(wbdf2['Country'])]
37 Cote d'Ivoire 46 Curacao 158 Korea, Dem. People’s Rep. 178 Sao Tome and Principe 183 Eswatini Name: Country Name, dtype: object
These are the same, but the differences are mostly extremely subtle in terms of spelling. There's also Eswatini, which is still referred to in the second dataframe by its now disfavoured name of Swaziland.
The trickiest problem is perhaps with the DPRK (i.e., North Korea), which has a ’ in one of the dataframes and a ' in the other... spot the difference? Many thousands of programmer hours are wasted every day because of these and their other quadruplet sisters, the backquote/backtick ` and the accent ´.
.replace()¶Above we used a Boolean filter to change all the values in the dataframe that matched the filter. However, here I cannot think of any clever way of solving the problem of the five differing names programmatically. Instead, we will simply have to change each of them one by one. Pandas provides the method .replace() for this.
It doesn't matter which of the two dataframes we do it in, or we could do it in both, as long as we end up with the same name in both.
Let's change the five differing values in the second dataframe only. We could do it one by one (.replace(oldname, newname)), but we can also use a dictionary:
country_corrections = {
# 'old_name' : 'new_name'
"Côte d'Ivoire" : "Cote d'Ivoire", # had to use double quotes because of the ' in d'Ivoire
'Curaçao' : 'Curacao',
'Korea, Dem. People\'s Rep.' : 'Korea, Dem. People’s Rep.', # Here I escaped the ' with \' instead of using double quotes
'São Tomé and Principe' : 'Sao Tome and Principe',
'Swaziland' : 'Eswatini',
}
wbdf2.replace(country_corrections, inplace=True)
We use inplace=True so that we make the change into the dataframe itself, as opposed to returning a changed version.
Let's look for differences again:
wbdf['Country Name'][~wbdf['Country Name'].isin(wbdf2['Country'])]
Series([], Name: Country Name, dtype: object)
Now the result is empty (there are no items that are in df2 but not in df1), so we can go ahead and create the merged dataframe again:
wbmerged = wbdf.merge(wbdf2, left_on='Country Name', right_on='Country')
wbmerged
| Country Code | Country Name | Fixed_broadband_subscriptions_2015 | Fixed_telephone_subscriptions_2015 | Mobile_cellular_subscriptions_2015 | Fixed_broadband_subscriptions_2016 | Fixed_telephone_subscriptions_2016 | Mobile_cellular_subscriptions_2016 | Fixed_broadband_subscriptions_2017 | Fixed_telephone_subscriptions_2017 | ... | Country | Region | Population | Income Group | CO2 emissions (metric tons per capita) | % Primary completion, female | % Primary completion, male | % Lower secondary completion, female | % Lower secondary completion, male | % Parliament seats held by women | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABW | Aruba | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | 141000.0 | 19000.0 | 35000.0 | ... | Aruba | Latin America & Caribbean | 104872.0 | High income | 8.426911 | NaN | NaN | NaN | NaN | NaN |
| 1 | AFG | Afghanistan | 7067.0 | 110000.0 | 19709000.0 | 8801.0 | 114192.0 | 21603000.0 | 16810.0 | 118769.0 | ... | Afghanistan | South Asia | 35383128.0 | Low income | 0.245101 | 63.604069 | 96.542091 | 36.819771 | 63.874481 | 27.710843 |
| 2 | AGO | Angola | 153571.0 | 284925.0 | 13884500.0 | 123135.0 | 304493.0 | 13001100.0 | 96919.0 | 161070.0 | ... | Angola | Sub-Saharan Africa | 28842484.0 | Lower middle income | 1.202861 | NaN | NaN | NaN | NaN | 36.818182 |
| 3 | ALB | Albania | 242870.0 | 226718.0 | 3400960.0 | 266379.0 | 248640.0 | 3369760.0 | 303612.0 | 246720.0 | ... | Albania | Europe & Central Asia | 2876101.0 | Upper middle income | 1.577163 | 102.297653 | 97.790390 | NaN | NaN | 22.857143 |
| 4 | AND | Andorra | 30694.0 | 38850.0 | 71336.0 | 32490.0 | 38694.0 | 76132.0 | 34284.0 | 38411.0 | ... | Andorra | Europe & Central Asia | 77297.0 | High income | 6.072370 | NaN | NaN | NaN | NaN | 32.142857 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 212 | XKX | Kosovo | NaN | 106300.0 | 562000.0 | NaN | NaN | NaN | NaN | NaN | ... | Kosovo | Europe & Central Asia | 1777557.0 | Upper middle income | 5.031520 | NaN | NaN | NaN | NaN | NaN |
| 213 | YEM | Yemen, Rep. | 395000.0 | 1142230.0 | 15022000.0 | 430400.0 | 1165830.0 | 16433100.0 | 469000.0 | 1190000.0 | ... | Yemen, Rep. | Middle East & North Africa | 27168210.0 | Low income | 0.390480 | 64.500092 | 79.805077 | 45.206539 | 60.714600 | 0.000000 |
| 214 | ZAF | South Africa | 1409350.0 | 4131060.0 | 87999500.0 | 1150770.0 | 4522850.0 | 82412900.0 | 1123190.0 | 4810070.0 | ... | South Africa | Sub-Saharan Africa | 56203654.0 | Upper middle income | 8.480658 | 88.068489 | 86.454247 | 83.567780 | 78.010681 | 42.000000 |
| 215 | ZMB | Zambia | 23390.0 | 116165.0 | 11557700.0 | 31784.0 | 101407.0 | 12017000.0 | 35912.0 | 101444.0 | ... | Zambia | Sub-Saharan Africa | 16363507.0 | Lower middle income | 0.314183 | NaN | NaN | NaN | NaN | 17.964072 |
| 216 | ZWE | Zimbabwe | 163987.0 | 333702.0 | 12757400.0 | 170838.0 | 305720.0 | 12878900.0 | 187310.0 | 264150.0 | ... | Zimbabwe | Sub-Saharan Africa | 14030390.0 | Lower middle income | 0.782777 | NaN | NaN | NaN | NaN | 31.481481 |
217 rows × 27 columns
Very nice, we now have the correct 217 countries (rows).
.groupby()¶Now that we have a masterfully curated dataset, we can do some actualy data science on it. For example, let's compare the CO2 emissions (metric tons per capita) by income group of the country:
wbmerged.groupby('Income Group').mean()['CO2 emissions (metric tons per capita)']
Income Group High income 8.221190 Low income 0.261406 Lower middle income 1.304570 Upper middle income 3.958749 Name: CO2 emissions (metric tons per capita), dtype: float64
Gosh... if there's a hell, those of us living in high-income nations are probably headed straight there.
Actually, we don't really have to specify a column at the end, we can just get the mean for all values:
wbmerged.groupby('Income Group').mean()
| Fixed_broadband_subscriptions_2015 | Fixed_telephone_subscriptions_2015 | Mobile_cellular_subscriptions_2015 | Fixed_broadband_subscriptions_2016 | Fixed_telephone_subscriptions_2016 | Mobile_cellular_subscriptions_2016 | Fixed_broadband_subscriptions_2017 | Fixed_telephone_subscriptions_2017 | Mobile_cellular_subscriptions_2017 | Fixed_broadband_subscriptions_2018 | ... | Fixed_broadband_subscriptions_2019 | Fixed_telephone_subscriptions_2019 | Mobile_cellular_subscriptions_2019 | Population | CO2 emissions (metric tons per capita) | % Primary completion, female | % Primary completion, male | % Lower secondary completion, female | % Lower secondary completion, male | % Parliament seats held by women | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Income Group | |||||||||||||||||||||
| High income | 4.879010e+06 | 6.246911e+06 | 1.949589e+07 | 5.114171e+06 | 6.167934e+06 | 1.983426e+07 | 5.322642e+06 | 6.053401e+06 | 1.997271e+07 | 5.484036e+06 | ... | 5.469725e+06 | 5.706193e+06 | 2.033054e+07 | 1.457980e+07 | 8.221190 | 99.551871 | 99.147545 | 98.412709 | 97.286141 | 22.533420 |
| Low income | 7.546474e+04 | 2.992354e+05 | 1.166737e+07 | 9.086304e+04 | 3.099661e+05 | 1.158824e+07 | 8.887256e+04 | 2.780780e+05 | 1.188766e+07 | 8.773011e+04 | ... | 9.397359e+04 | 2.781591e+05 | 1.264051e+07 | 2.197872e+07 | 0.261406 | 63.345006 | 68.752257 | 36.108126 | 43.148705 | 19.870146 |
| Lower middle income | 1.059873e+06 | 1.591787e+06 | 4.665254e+07 | 1.224295e+06 | 1.515669e+06 | 5.016339e+07 | 1.351712e+06 | 1.465390e+06 | 5.161023e+07 | 1.525529e+06 | ... | 1.655968e+06 | 1.404059e+06 | 5.506184e+07 | 5.581236e+07 | 1.304570 | 90.798652 | 90.303550 | 69.319957 | 66.456692 | 19.834775 |
| Upper middle income | 7.588798e+06 | 8.157082e+06 | 5.490236e+07 | 8.754493e+06 | 7.817153e+06 | 5.831417e+07 | 1.027017e+07 | 7.527917e+06 | 6.108573e+07 | 1.071907e+07 | ... | 1.161031e+07 | 7.042373e+06 | 6.403536e+07 | 4.992563e+07 | 3.958749 | 97.394501 | 97.452526 | 91.169062 | 85.479867 | 20.515271 |
4 rows × 22 columns
Besides mean(), other functions that we may want to use are count(), sum(), min(), max(), std() and median(), and in fact there are ways to use any function we wish.
Let's take those latter result and sort them by the most sinful carbon producers:
wbmerged.groupby('Income Group').mean().sort_values(by='CO2 emissions (metric tons per capita)', ascending=False)
| Fixed_broadband_subscriptions_2015 | Fixed_telephone_subscriptions_2015 | Mobile_cellular_subscriptions_2015 | Fixed_broadband_subscriptions_2016 | Fixed_telephone_subscriptions_2016 | Mobile_cellular_subscriptions_2016 | Fixed_broadband_subscriptions_2017 | Fixed_telephone_subscriptions_2017 | Mobile_cellular_subscriptions_2017 | Fixed_broadband_subscriptions_2018 | ... | Fixed_broadband_subscriptions_2019 | Fixed_telephone_subscriptions_2019 | Mobile_cellular_subscriptions_2019 | Population | CO2 emissions (metric tons per capita) | % Primary completion, female | % Primary completion, male | % Lower secondary completion, female | % Lower secondary completion, male | % Parliament seats held by women | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Income Group | |||||||||||||||||||||
| High income | 4.879010e+06 | 6.246911e+06 | 1.949589e+07 | 5.114171e+06 | 6.167934e+06 | 1.983426e+07 | 5.322642e+06 | 6.053401e+06 | 1.997271e+07 | 5.484036e+06 | ... | 5.469725e+06 | 5.706193e+06 | 2.033054e+07 | 1.457980e+07 | 8.221190 | 99.551871 | 99.147545 | 98.412709 | 97.286141 | 22.533420 |
| Upper middle income | 7.588798e+06 | 8.157082e+06 | 5.490236e+07 | 8.754493e+06 | 7.817153e+06 | 5.831417e+07 | 1.027017e+07 | 7.527917e+06 | 6.108573e+07 | 1.071907e+07 | ... | 1.161031e+07 | 7.042373e+06 | 6.403536e+07 | 4.992563e+07 | 3.958749 | 97.394501 | 97.452526 | 91.169062 | 85.479867 | 20.515271 |
| Lower middle income | 1.059873e+06 | 1.591787e+06 | 4.665254e+07 | 1.224295e+06 | 1.515669e+06 | 5.016339e+07 | 1.351712e+06 | 1.465390e+06 | 5.161023e+07 | 1.525529e+06 | ... | 1.655968e+06 | 1.404059e+06 | 5.506184e+07 | 5.581236e+07 | 1.304570 | 90.798652 | 90.303550 | 69.319957 | 66.456692 | 19.834775 |
| Low income | 7.546474e+04 | 2.992354e+05 | 1.166737e+07 | 9.086304e+04 | 3.099661e+05 | 1.158824e+07 | 8.887256e+04 | 2.780780e+05 | 1.188766e+07 | 8.773011e+04 | ... | 9.397359e+04 | 2.781591e+05 | 1.264051e+07 | 2.197872e+07 | 0.261406 | 63.345006 | 68.752257 | 36.108126 | 43.148705 | 19.870146 |
4 rows × 22 columns
...well, no surprises there.
Our second imported data has only the values for 2016, whereas the first one had 2015 to 2019. If we want to analyse them together, should we keep the 2019 data from the first one since it's more recent, or should we use the 2016 data only so that it matches the status at the time of the other data? Both approaches may be defensible, and in your data research you'll need to make decisions like this. Just make sure to explain it clearly to yourself and to your colleagues/reviewers/peers/readers, especially by thoroughly annotating and publishing your Jupyter notebooks.
pandas data¶Plotting in Matplotlib is covered in further detail in a separate notebook, but Pandas makes plotting its data extremely straightforward.
Let's first create a simple dataframe:
df = pd.DataFrame(
{
'threes' : [0, 3, 6, 9, 12, 15],
'squares' : [0, 9, 36, 81,144, 225],
'cubes' : [0,27,216,729,1728,3375]
}
)
df
| threes | squares | cubes | |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 1 | 3 | 9 | 27 |
| 2 | 6 | 36 | 216 |
| 3 | 9 | 81 | 729 |
| 4 | 12 | 144 | 1728 |
| 5 | 15 | 225 | 3375 |
Plotting it is extremely easy:
fig = df.plot(x='threes', y=['squares','cubes'], kind='line')
By default, pandas uses Matplotlib as its default plotting software. This can be changed to the more interactive Plotly:
pd.options.plotting.backend = "plotly"
Now we can use Plotly just like we used Matplotlib above:
fig = df.plot('threes', ['squares', 'cubes'], kind='scatter') # also try kind='line', or 'bar'
fig.layout.title = "Squares and cubes of three"
fig.show()
You will see that Plotly graphs are dynamic! Scroll over the values to see further details in pop-up clouds. You can also select a box to zoom into, and double-click to get back to full view. Click on "cubes" in the legend to hide/show the cubes data...
This barely scratches the surface. Plotly is extremely powerful, and I strongly recommend it if you are considering advanced data visualisation. Here's a final example using the data from the merged World Bank data above:
wbmerged.boxplot(
x='Income Group',
y='CO2 emissions (metric tons per capita)',
category_orders={'Income Group':['Low income', 'Lower middle income', 'Upper middle income', 'High income']},
color = 'Income Group',
color_discrete_sequence = ['rgb(107,174,214)', 'royalblue', 'rgb(8,81,156)', 'rgb(7,40,89)']
)
pandas dataframes to R and vice versa¶You could of course export to TSV or CSV and then import into R in R-Studio, or vice versa, and sometimes that's the best solution. But you can in fact send dataframes between the two languages right here within Jupyter. We will cover that in the Interfacing R with Python lesson, but here's a quick demonstration of how easy this is.
First, we load the R extension for Jupyter:
#!conda install -y rpy2 # uncomment to install the r-to-python (rpy2) in case you don't have it already
%load_ext rpy2.ipython
And now we can declare a full cell as R code by using the %%R Jupyter "magic". We import the dataframe into R by using the -i flag, and from then on we can just do our normal everyday R, including plotting (in base R or ggplot) and such:
%%R -i df
# The following is regular R code!
# You can do anything you would typically do in your R code, including ggplot2, stats, modelling and so on
print(df)
plot(df$threes, df$squares, pch='*', cex=3, col='salmon')
threes squares cubes 0 0 0 0 1 3 9 27 2 6 36 216 3 9 81 729 4 12 144 1728 5 15 225 3375
⚠️ This was only a brief demonstration; please see the Interfacing R with Python notebook for a more thorough introduction.